Crawling - part II 
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Coverage 


Good coverage is obtained by carefully selecting seed URLs and 
using a good page selection policy to decide what to crawl next. 


Breadth-first search is adequate when you have simple needs, but 
many techniques outperform it. It particularly helps to have an existing 
index from a previous crawl. 
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Coverage Goals 


The Internet is too large and changes 
too rapidly for any crawler to be able 
to crawl and index it all. Instead, a 


def crawl(seeds): 


crawler should focus on strategic Meu uM ni" rr M 
Crawling tO balance coverage and # Iteratively crawl the next item in the frontier 
freshness. while not frontier.is empty(): 
Crawl the next URL and extract anchor tags from it 
NR. . url = frontier.choose next() 
A crawler should prioritize crawling age = crawl url(url) 
high-quality content to better answer urls = parse page(page) 
user queries. The Internet contains a : Mah n ies rur send the page to the indexer 
: : rontier.add pages\urts 
lot of spam, redundant information, send to indexer(page) ` 
and pages which arent likely to be Basic Crawler Algorithm 


relevant to users' information needs. 


Selection Policies 


A selection policy is an algorithm used to select the next page to crawl. Standard 
approaches include: 


* Breadth-first search: [his distributes requests across domains relatively well and 
tends to download high-PageRank pages early. 


e Backlink count: Prioritize pages with more in-links from already-crawled pages. 


* Larger sites first: Prioritize pages on domains with many pages in the frontier. 


Partial PageRank: Approximate PageRank scores are calculated based on 
already-crawled pages. 


There are also approaches which estimate page quality based on a prior crawl. 


Comparing Approaches 


Chile, May 2004 


Baeza-Yates et al compare these approaches 
to find out which fraction of high quality pages 
in a collection is crawled by each strategy at 
various points in a crawl. 


Breadth-first search does relatively poorly. 
Larger sites first is among the best 
approaches, along with "historical 
approaches which take Pagehank scores from 
a prior crawl into account. 


OPIC, a fast approximation to PageRank which 
can be calculated on the fly, is another good 
choice. The “omniscient” baseline always 
fetches the highest PH page in the frontier. 
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Greece, May 2004 
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Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, and Andrea Rodriguez. 2005. Crawling 
a country: better strategies than breadth-first for web page ordering. 


Obtaining Seed URLs 


wf Follow @dmoz about dmoz | dmoz blog | suggest URL | help | link | editor login 


It's Important to choose the right sites om ata 
to initialize your frontier. A simple T — — 


e as e | | a e a 9 O roO ac h | S tO ST a a W | [ h th e Movies, Television, Music... Jobs, Real Estate, Investing... Internet, Software, Hardware... 
Games Health Home 


sites In an) Internet directory, such as Video Games, RPGs, Gambling... Fitness, Medicine, Alternative... Family, Consumers, Cooking... 


Kids and Teens News Recreation 
a tt 9, i l | WWW. d Y] Q Z i O [( ] " Arts, School Time, Teen Life... Media, Newspapers, Weather... Travel, Food, Outdoors, Humor... 


Reference Regional Science 


In general, good hubs tend to lead to Shopping —- Society - Sports 
many high-quality web pages. [hese -" 


hubs can be identified with a careful Catala, Cesky, Dansk, Deutsch, Español, Esperanto, Français, Galego, Hrvatski, Italiano, Lietuviu, 


Magyar, Nederlands, Norsk, Polski, Portugués, Romana, Slovensky, Suomi, Svenska, Türkçe, 
fie + wy, 


analysis Of a prior crawl. Bearapcxn, EAAnvixá, Pycckuii, Vkpaincba, tzl, ma, lns, HÆ, 


ELIGE Help build the largest human-edited directory of the web | 


Copyright © 1998-2015 AOL Inc. 


http://www.dmoz.org 


The Deep Web 


Despite these techniques, a substantial fraction of web pages remains uncrawled and 
unindexed by search engines. These pages are known as “the deep web." 


These pages are missed for many reasons. 


e Dynamically-generated pages, such as pages that make heavy use of AJAX, rely on 
web browser behavior and are missed by a straightforward crawl. 


e Many pages reside on private web sites and are protected by passwords. 


e Some pages are intentionally hidden, using robots.txt or more sophisticated approaches 
such as “darknet” software. 


opecial crawling and indexing techniques are used to attempt to index this content, such 
as rendering pages in a browser auring the crawl. 


Freshness 


The web is constantly changing, and re-crawling the latest changes 
quickly can be challenging. 


It turns out that aggressively re-crawling aS soon as a page changes IS 
sometimes the wrong approach: its better to use a cost function 
associated with the expected age of the content, and tolerate a small 
delay between re-crawils. 
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Page Freshness 


The web is constantly changing as content is added, deleted, and 
moditied. In order for a crawler to reflect the web as users will 
encounter it, it needs to recrawl content soon after it changes. 


This need for freshness is key to providing a good search engine 
experience. For instance, when breaking news develops, users will 
rely on your search engine to stay updated. 


Its also important to refresh less time-sensitive documents so the 
results list doesn't contain spurious links to deleted or modified data. 


HI TP HEAD Requests 


A crawler can determine whether a 
page has changed by making an 
HITP HEAD request. 


Tne response provides the HTTP 
status code and headers, but not the 
document body. The headers include 
information about when the content 
was last updated. 


However, it's not feasible to constantly 
send HEAD requests, so this Isn't an 
adequate strategy for freshness. 


Request 
HEAD /csinfo/people.html HTTP/1.1 
Host: www.cs.umass.edu 


Response 


HTTP/1.1 200 OK 

Date: Thu, 03 Apr 2008 05:17:54 GMT 

Server: Apache/2.0.52 (CentOS) 
Last-Modified: Fri, 04 Jan 2008 15:28:39 GMT 
ETag: "239c33-2576-2a2837c0" 

Accept-Ranges: bytes 

Content-Length: 9590 

Connection: close 

Content-Type: text/html; charset-IS0-8859-1 


Freshness vs. Age 


It turns out that optimizing to minimize | | | 
freshness is a poor strategy: it can | | | 

lead the crawler to ignore important | 

sites. freshness — . 3 

Instead, it's better to re-crawl pages 
when the age of the last crawled 
version exceeds some limit. The age 


of a page is the elapsed time since 
the first update after the most recent 


crawl updates crawl | update crawl 


crawl. a ! | 
Freshness is binary, age is continuous. 


Expected Page Age 


The expected age of a page t days after it was crawled depends on its 
update probability: 


t 
age(À, t) = | P(page changed at time x) (t — x)dx 
0 


On average, page updates follow a Poisson distribution — the time until 
the next update is governed by an exponential distribution. [his makes 
the expected age: 


t 
age(À, t) = j Ae "(t — x)dx 
0 


Cost of Not Re-crawling 


The cost of not re-crawling a page grows exponentially in the time 
since the last crawl. For instance, with page update frequency A = 1/7 


days: 


2: 


Expected Age 
in 


2 4 6 
Days Elapsed 


Freshness vs. Coverage 


The opposing needs of Freshness and Coverage need to be balanced 
in the scoring function used to select the next page to crawl. 


Finding an optimal balance is still an open question. Fairly recent 
studies have shown that even large name-brand search engines only 
do a modest job at finding the most recent content. 


However, a reasonable approach is to include a term in the page 
priority function for the expected age of the page content. For 
important domains, you can track the site-wide update frequency A. 


Technique Objectives Factors considered 
Coverage | Freshness | Importance | Relevance | Dynamicity 


Breadth-first search |43, 95, 108] 
Prioritize by indegree [43] 
Prioritize by PageRank [43, 45] 
Prioritize by site size [9] 
Prioritize by spawning rate [48] 
Prioritize by search impact [104] 
Scoped crawling (Section 4.2) 


eg yy a, ee ia 
E 


SS 


Minimize obsolescence [41, 46] 
Minimize age [41] 

Minimize incorrect content [99] 
Minimize embarrassment [115] 
Maximize search impact [103] 
Update capture (Section 5.2) 


E c ER 


A LUE 


Sem oS oe o Sn 


WebFountain [56] 
OPIC [1] 


SESS | oy ee oe 


a 
I 


3.2 Taxonomy of crawl ordering techniques. 


Pitfalls of Crawling 


A breaath-tirst search implementation of crawling is not sufficient for 
coverage, freshness, spam avoidance, or other needs of a real 
crawler. 


ocaling the crawler up takes careful engineering, and often detailed 
systems knowledge of the hardware architecture you're developing for. 
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Crawling at Scale 


A commercial crawler should support thousands of HTTP requests per second. If the 
crawler is distributed, that applies for each node. Achieving this requires careful 
engineering of each component. 


e DNS resolution can quickly become a bottleneck, particularly because sites often 
have URLS with many subdomains at a single IP address. 


e [he frontier can grow extremely rapidly — hundreds of thousands of URLs per second 
are not uncommon. Managing the filtering and prioritization of URLs is a challenge. 


¢ Spam and malicious web sites must be addressed, lest they overwhelm the frontier 
and waste your crawling resources. For instance, some sites respond to crawlers by 
intentionally adding seconds of latency to each HTTP response. Other sites respond 
with data crafted to confuse, crash, or mislead a crawler. 


Duplicate URL Detection at Scale 


Lee et als DRUM algorithm gives a sense of 
the requirements of large scale de-duplication. 


It manages a collection of tuples of keys 


(hashed URLs), values (arbitrary data, such as 


quality scores), and aux data (URLs). It 
supports the following operations: 


e check - Does a key exist? If so, fetch its 
value. 


* update - Merge new tuples into the 
repository. 


e check+update - Check and update in a 
single pass. 


: E H I 

: <key,value> buffer 1 fi~ Q1 
«key, value,aux-- - 
tuples i i 


a es BÓ? LLLLLLL LL nnn nr nr ooo LLL LZ oL LLL 


RAM 


Qi Qi 


«key, value» buffer k 


aux buffer 


ed ft |, 


Data flow for DRUM: A tiered system of 
buffers in RAM and on disk is used to 
support large-scale operations. 


IRLBot Operation 


DRUM is used a storage for the IRL Bot 
crawler. A new URL passes through the 
following steps. 


1. The URL Seen DRUM checks whether the 
URL has already been fetched. 


2. If not, two budget checks filter out spam 
links (discussed next). 


3. Next, we check whether the URL passes 
its robots.txt. If necessary, we fetch 
robots.txt from the server. 


4. Finally, the URL Is passed to the queue to 
be crawled by the next available thread. 


IRLBot Architecture 
1. Uniqueness check 
crawling | "^" ans unique URLs 
threads DRUM 
check * EI 2. Spam check 


update 


STAR budget 
i check 
Maea A robots download unique 
bii queue Q, ostnames 


dat unable to ara 
update robots request | hostnames DRUM 
queue Q RobotsRequested 


check + 
update 
BEAST budget 
enforcement 
pass budget 


robots-check 
queue Qpr 


3. robots.txt check 


pass RobotsCache check 


robots 


ready queue Q 


4. Sent to crawlers 


Link Spam 


The web is full of link farms and other forms of link spam, generally posted by 
people trying to manipulate page quality measures such as PageRank. 


These links waste a crawlers resources, and detecting and avoiding them Is 
important for correct page quality calculations. 


One way to mitigate this, implemented in IRLBot, is based on the observation 
that spam servers tend to have very large numbers of pages linking to each 
other. 


They assign a budget to each domain based on the number of in-links from 
other domains. The crawler de-prioritizes links from domains which have 
exceeded their budget, so link-filled spam domains are largely ignored. 


Spider Traps 


A spider trap is a collection of web 
pages which, intentionally or not, 
provide an infinite space of URLs to 
crawl. 


oome site administrators place spider 
traps on their sites in order to trap or 
crash spambots, or defend against 
malicious bandwidth-consuming 
scripts. 


A common example of a benign spider 
trap Is a calendar which links 
continually to the next year. 


Home 


Calendars 


World Clock 


Time Zones 


Calendar 2015 Calendar 2016 Monthly Calendar PDF Calendar Add Events Calendar Creator Adv. Calendar Creator Holidays 


4:0 12:0 20:0 26:0 


2015 « 2115 2116 2117» 


Calendar for year 2116 (United States) 


22 23 24 25 26 27 28 
29 30 31 
6:0 14:9 21:0 28:0 


9||7 8 9 10 11 12 13 
10 11 12 13 14 15 16||14 15 16 17 18 19 20 
4 17 18 19 20 21 22 23||21 22 23 24 25 26 27 


26 27 28 29 30 24 25 26 27 28 29 30||28 29 30 


31 
4:0 12:9 19:0 26:0 


A benign spider trap on 
http://www.timeanddate.com 


3:0 10:@ 17:0 24:0 


rs Apps & API Free Fun 


sw] f 


c 


Avoiding Spider Traps 


[..] 


User-agent: * 


The first defense against spider traps Disallow: /createshort.htm 
Disallow: /scripts/savecustom.php 
IS Ke have dà good politeness policy, Disallow: /scripts/wquery.php 
and always follow it. Disallow: /scripts/tza.php 


Disallow: /scripts/savepersonal.php 
Disallow: /information/mk/ 


ort Disallow: /information/feedback-save.php 
BY avolding frequent requests tO the Disallow: /information/feedback.html? 
same domain, you reduce the eti de 
possible damage a trap can do. Disallow: /eclipse/in/"?iso 


Disallow: /custom/save.php 
Disallow: /calendar//index.html 


e Most sites with Spider traps provide Disallow: /calendar//monthly.htm| 
. . m l Disallow: /calendar//custom.html 
instructions for avoiding them in Disallow: /counters//newyeara.html 
Disallow: /counters//worldfirst.html 
robots.txt. L..] 


From http://www.timeanddate.com/robots.txt 


otoring Crawled Content 


We need to normalize and store the contents of web documents so they 
can be indexed, so snippets can be generated, and so on. 


Online documents have many formats and encoding schemes. There are 
hundreds of character encoding systems we haven't mentioned here. 


A good document storage system should support efficient random 
access for lookups, updates, and content retrieval. Often, a distributed 
storage system like Big lable Is used. 
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Content Conversion 


Downloaded page content generally 
needs to be converted into a stream of 
tokens before it can be indexed. 


Content arrives in hundreds of 
incompatible formats: Word documents, 
PowerPoint, RTF, OTF, PDF, etc. 
Conversion tools are generally used to 
transform them into HIML or XML. 


Depending on your needs, the crawler 
may store the raw document content 
and/or normalized content output from a 
converter. 


HTML PDF RTF 


V V 


HTML HTML 


V V 


Document Repository 


Character Encodings 


Crawled content will be represented USASCII code chart 
with many different character Po leo Pa Pa oo a 
encodings, which Gan easily contuse "vu emt © | | [2 fa }4 is |e [7 
text processors. o[ojo[o| O nu joe | SS] o jJ e|» | ^ |» | 
ojojoj ! [son [oc] ! | t J^|0o |o [|o | 
"n" m GOURS CACARA CN ICE ICON CN CUN 
| 3 [ETx |oc3 | # | 3 | c | s | ¢ | ss. 
ha ta tia ¢ M ana P | eEERERD|RIBSgi[e 

its in a file to S on a screen. In 
| JYP iir a os NUR RC MORGEN 
English, the basic encoding is ASCII. oft fit | 7 [eec jee | [7 [oe fw lo fw | 
jojojo. Bs [can | C | 8 | "j| x [^| x 
mere A 1L 
LL ; 

ipt -— 8 a 7 bits to represent Hott ee ee 
etters, numbers, punctuation, an Ps ee ee 
P | hito es a ce 
control characters and an extra bit for fifi fo} Pp. ft > fn PAT. Teo 
cote cyt? fof te [o 


padding. 


z 


Image courtesy Wikipedia 


Unicode 


The various Unicode encodings were 
invented to support a broader range of 


characters. Unicode Is a single mapping ma iiio dim 

trom numbers to glyphs, with various 

encoding schemes of different sizes. A 0x41 0x41 0x00000041 

e UTF-8 uses one byte for ASCII & 0x26 0x26 0x00000026 
characters, and more bytes for — ^ 0 
extended characters. It's often e N/A OxCF 0x80 0x000003C0 
preferred for file storage, — 0 0 0 0 0 0 

* UTF-32 uses four bytes for every ls N/A » pid 0x0001F44D 
Character, and is more convenient for 


use IN memory. 


UIF-8 


UTF-8 uses a variable-length encoding 
scheme. 


If the most significant (leftmost) bit of a 
given byte Is set, the character takes 
another byte. 


Decimal 
0-127 
The first 128 numbers are the same as 128-2047 
ASCII, so any ASCII document could be as 
said to (retroactively) use UTF-8. —— 


65536-1114111 


UTF-8 is designed to minimize disk space 
for documents in many languages, but 
UTF-32 is faster to decode and easier to 
use in memory. 


UTF-8 Encoding Scheme 


Hexadecimal Encoding 

0—7F Oxxxxxxx 

80-7TFF 110xxxxx  10xxxxxx 

800-D7FF 1110xxxx  10xxxxxx  10xxxxxx 
D800-DFFF Undefined 

E000-FFFF 1110xxxx  10xxxxxx  10xxxxxx 


10000-10FFFF 11110xxx  10xxxxxx  10xxxxxx 


hive oe © 


Document Hepositories 


What do we need from our document repository? 


* Fast random access - need to store and obtain documents by their URLs (or a hash of 
the URL) 


* Fast document updates - need to associate and update metadata with documents, 
and replace (or append to) records when documents are re-crawled 


* Compressed storage - greatly reduces storage needs, and minimizes disk reads for 
access 


* Large file storage — multiple documents are stored in a single large file to reduce 
filesystem overhead 


Most companies use custom storage systems, or distributed systems like Big lable. 


Large File Storage 


Placing millions or billions of web 
pages in individual files results in 
substantial filesystem overhead for 
opening, writing, and finding files. 


It's important to store many files into 
larger files, generally with an indexing 
scheme to give fast random access. 


A simple index might store a B-tree 
mapping document URL hash values 
to the byte offset to the document 
contents in the Tile. 


TREC Web Format 


«DOC» 

<DOCNO>WTX001-B01-10</DOCNO> 

<DOCHDR> 

http://www.example.com/test.html 204.244.59.33 19970101013145 text/html 440 
HTTP/1.0 200 OK 

Date: Wed, 01 Jan 1997 01:21:13 GMT 

Server: Apache/1.0.3 

Content-type: text/html 

Content-length: 270 

Last-modified: Mon, 25 Nov 1996 05:31:24 GMT 
</DOCHDR> 

<HTML> 

<TITLE>Tropical Fish Store</TITLE> 

Coming soon! 

</HTML> 

</DOC> 

<DOC> 

<DOCNO>WTX001-B01-109</DOCNO> 

<DOCHDR> 

http://www.example.com/fish.html 204.244.59.33 19970101013149 text/html 440 
HTTP/1.0 200 OK 

Date: Wed, 01 Jan 1997 01:21:19 GMT 

Server: Apache/1.0.3 

Content-type: text/html 

Content-length: 270 

Last-modified: Mon, 25 Nov 1996 05:31:24 GMT 
</DOCHDR> 

<HTML> 

<TITLE>Fish Information</TITLE> 

This page will soon contain interesting 
information about tropical fish. 

</HTML> 

</DOC> 


Vertical Search 


Vertical Search depends on crawling a collection on the topic of 
interest. 


General search engines also use topical crawlers to improve their 
coverage for key topics. 


The main trick to topical crawling is finding topical pages which are 
only reachable by exploring off-topic pages through careful risk-taking. 
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Vertical Search 


Vertical Search engines focus on a 
particular domain of information. 


The primary difference between 
vertical and general search engines IS 
the set of documents they crawl. 
Vertical Search engines typically use 
what are known as topical crawlers. 


Document Authors Tables MetaCa Sign up Log 
C . X 
| t e S e e r vertical search Search 
Include Citations Advanced Search 
Results 1 - 10 of 119,317 Next 10 — Tools 


Sorted by 
by Mic icha el Chau enar me ape , 2003 Rel - 
Vertical Sea rs In cific rch experiments, a Web spider based ona neuralnetwork.." | § | "eevance 
Abstr a OMM Add to Ma ca "5 
Try your query at: 


: Scholar Yahoo! DBLP 
Depth first search and linear graph algorithms 
pr obe: d — Soir nnd ing, 1972 

rule is called a depth-first search. The of old vertices ..." 
Fete € gars à enema Me taCart 


Bing CSB Academic 


Suffix arrays: hod for ring searches 
by Ud i Manber a o Gene Myerst - SIAM J. Comput , 199 


and Applied Mathematics 003 SUFFIX ARRAYS: A NEW METHOD FOR ON-LINE STRING SEARCHES" UDI MANBER tt Argo 


GENE 
Abstract - Cited by 646 (1 self) - Add to MetaCart 


A greedy randomiz ptive search procedure for the 2-partition problem 
gira nidi sain eni —— a! 1994 

ximum independent set in the graph induced by the vertices of G not adjacent ..." 
Absaet- Oad i 402. min a -Addio MetaCart 


CiteSeer, Vertical Search for Research 


Topical Crawlers 


Topical Crawlers focus on documents 
related to a particular topic of interest. 


These crawlers are useful for improving 
the collection quality of general search 
engines, too. Many search engines use 
a variety of topical crawlers to 
supplement their primary crawler. 


A basic approach uses a topical set of 
seed URLs and text classifiers to 
decide whether links appear to be on 
topic. 


def crawl(seeds): 


# High quality topical hubs are used as seeds 
frontier.add_pages(seeds) 


# Iteratively crawl the next item in the frontier 
while not frontier.is empty(): 


# Crawl the next URL and extract anchor tags from it 
url = frontier.choose next() 

page = crawl url(url) 

urls - parse page(page) 


4 The URLs are filtered to stay on topic 

urls - filter by topic(urls) 

# Update the frontier and send the page to the indexer 
frontier.add pages(urls) 

send to indexer(page) 


Basic Topical Crawler 


Text Classifiers 


Classification with Language Models 


Text classification is a Machine 


Learning task that we'll see later in the 1. 


COULSO. 


The idea is to use properties of the 
UHL, anchor text, and document to 
predict whether the UHL links to a 
page on the topic of interest. 


For example, we could use a unigram 
language model trained on anchor 
text for topical links. 


Collect anchor text for links to topical 
and non-topical pages. 


. Train a unigram language model by 


producing smoothed probability 
estimates of topicality for each term. 


. Classify new links using the odds ratio 


from training data for some threshold A: 


Pr(w|topic — 1) 
Pr(w|topic — 0) 


7 
>A 


wE text 


Explore vs. Exploit [radeoff 


More sophisticated topical crawlers 
use machine learning techniques to 
balance the tradeoff between 


exploring new territory and exploiting Em t S N 


inks which are probably high-quality. 
Good 
* Exploit-only strategies may miss J V4 
high quality pages which aren't ad ; 
tightly linked to the seed set. Bad 7 7. Bad 


e Explore-only strategies will ignore Good 
high-quality pages we can easily 


find. 


Good ———— 


Sometimes bad links must be explored to find good links 


Careful Exploration 


There are many ways to balance exploration and exploitation, and this topic is 
actively researched for many applications. Here are some simple ways for this task. 


e Adjust the classification threshold to manage your risk threshold. 


Flip a biased coin to decide whether to visit a page which doesnt seem 
promising. 


e |f using a document quality score such as PageRank, explore for a while without 
updating quality scores. Links on crawled pages wont be taken into account, so 
scores will be somewhat inaccurate and you will explore more. 


There are more sophisticated approaches if maximizing performance is important. 


Crawling Structured Data 


In addition to the obvious content for human readers, the web contains 
a great deal of structured content for use in automated systems. 


e Document feeds are an important way to manage freshness at some 
of the most frequently-updated web sites. 


e Much of the structured data owned by various web entities is 
published in a structured format. [his can provide signals for 
relevance, and can also aid in reconstructing structurea databases. 
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Structured Web Data 


In addition to unstructured document contents, a great deal of 
structured data exists on the web. We'll focus here on two types: 


e Document feeds, which sites use to announce their new content 


* Content metadata, used by web authors to publish structured 
oroperties of objects on their site 


Document Feeds 


oites which post articles, such as blogs or 
news sites, typically offer a listing of their 
new content in the form of a document 


fe e qd Please note that by accessing CNN RSS feeds, you agree to our terms of use. 


CNN.com now offers podcasting feeds. H 


What is RSS? | How do | access RSS? 
Several common feed formats exist. One of 


the most popular is RSS, which stands for | PES mà ——HÁ Ir 
http;/rss.cnn.com/rss/cnn to x 

(take your pick). World http://rss.cnn.com/rss/cnn_world.rss MY YxHOO! | 
Rich Site Summary U.S. http://rss.cnn.com/rss/cnn_us.rss EJ My YaxHoo! | 

Business (CNNMoney.com) http://rss.cnn.com/rss/money latest.rss BH MY YxHoo! | 
e Really Simple Syndication ind ————— B Enyo] 

Crime http://rss.cnn.com/rss/cnn_crime.rss E3 my Yax00! | 
e RDF site Summary Technology http:/rss.onn.com/rss/cnn tech.rss Ea mv Yxroo 


: 5 http://www.cnn.con/services/rss/ 


RSS Format 


<?xml version="1.0"7> 
. «rss version-"2.0"» 

RSS is an XML format for document «channel» 

: : <title>Search Engine News</title> 

| | ST] nN Q S . <link>http://www.search-engine-news.org/</link> 
<description>News about search engines.</description> 
<language>en-us</language> 
<pubDate>Tue, 19 Jun 2008 05:17:00 GMT</pubDate> 


RSS files are obtained just like web <tt1>60</tt1> 
pages, with HTTP GET requests. item 


<title>Upcoming SIGIR Conference</title> 
<link>http://www.sigir.org/conference</link> 
<description>The annual SIGIR conference is coming! 


The ttl field provides an amount of Mark your calendars and check for cheap 


flights.</description> 


time (in minutes) that the contents <pubDate>Tue, 05 Jun 2008 09:50:11 GMT</pubDate> 


<guid>http: //search-engine-news. org#500</guid> 


should be cached. pias 


<title>New Search Engine Textbook</title> 
<link>http://www.cs.umass.edu/search-book</link> 
<description>A new textbook about search engines 
RSS feeds are very useful for will be published soon.</description> 
AI i <pubDate>Tue, 05 Jun 2008 09:33:01 GMT</pubDate> 
efficiently managing freshness of NEWS <guid>http://search-engine-news .org#499</guid> 
</item> 


and blog content. </channe1> 


</rss> 


RSS Example 


Structured Data 


Many web pages are generated from 
structured data in databases, which 
can be useful for search engines and 
other crawled document collections. 


Several schemas exist for web authors 
to publish their structured data for 
these tools. 


The WHATWG web specification 
working group has produced several 


standard formats for this data, such as 


microdata embedded in HTML. 


«section itemscope itemtype-"http://schema.org/Person"» 


Hello, my name is 
«span itemprop-"name"»John Doe</span>, 
I am a 
«span itemprop-"jobTitle"»graduate research assistant</span> 
at the 
«span itemprop="affiliation">University of Dreams</span>. 
My friends call me 
«span itemprop="additionalName">Johnny</span>. 
You can visit my homepage at 
«a href-"http://www.JohnnyD.com" itemprop="url">www.JohnnyD.com</a>. 
«section itemprop-" address" itemscope itemtype="http://schema.org/PostalAddress"> 
I live at 
«span itemprop="streetAddress">1234 Peach Drive</span>, 
«span itemprop-"addressLocality"»Warner Robins</span>, 
«span itemprop="addressRegion">Georgia</span>. 
</section> 


</section> 


Web Ontologies 


The main web ontology is published at 
schema.org. These schemas are used 


to annotate web pages for automated 
Information extraction tools. 


As the published information is not 
necessarily authoritative, the data 

needs to be carefully validated for 
quality and spam removal. 


Popular schema.org entities 


Creative works: CreativeWork, Book, Movie, MusicRecording, 


Crawling - Wrap Up 


Northeastern University CS6200: Information Retrieval 
College of Computer and Information Science Slides by: Jesse Anderton 


Goals of Crawling 


A good crawler will balance several factors: 


Coverage: Pages should be selected to maximize the number of distinct 
high-quality pages. 


Freshness: Pages which have been updated should be re-crawled soon. 


Performance: Each machine should crawl thousands of pages per 
Second. 


Politeness: Requests to the same domain are infrequent, and site 
owners requested crawler policies are respected. 


Major Challenges 


High-performance data structures, such as IRLbots DRUM, must be 
used to efficiently de-duplicate URLs, manage robots.txt caches, etc. 


Malicious web content should be carefully avoided, and low-quality 
content (malformed HTML, unreliable web sites, etc.) should be 
identified and dealt with as appropriate. 


Web site owners generally want their information to be crawled, so 
they provide assistance in terms of sitemaps, RSS, embedded 
metadata, etc. 


spam Technologies 


e Cloaking 
- Serve fake content to search engine robot — 
- DNS cloaking: Switch IP address. Impersonate Cloaking 
: Doorway pages 
- Pages optimized for a single keyword that re-direct to the real ta 
: Keyword Spam 

- Misleading meta-keywords, excessive repetition of a term, fake “ 
- Hidden text with colors, CSS tricks, etc. | 
: Link spamming 

- Mutual admiration societies, hidden links, awards 
- Domain flooding: numerous domains that point or re-direct to a t 
: Robots 

- Fake click stream 

- Fake query stream 

- Millions of submissions via Add-Url 


Is this a Search 
Engine spider? 


Meta-Keywords - 
"... London hotels, hotel, holiday inn, 


hilton, discount, booking, reservation, 
sex, mp3, 
britney spears, viagra, ..." 


